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Abstract 

We propose a new prior for ultra-sparse signal detection that we term the "horseshoe+ prior." 

The horseshoe+ prior is a natural extension of the horseshoe prior that has achieved success in 
the estimation and detection of sparse signals and has been shown to possess a number of de¬ 
sirable theoretical properties while enjoying computational feasibility in high dimensions. The 
horseshoe+ prior builds upon these advantages. Our work proves that the horseshoe+ poste¬ 
rior concentrates at a rate faster than that of the horseshoe in the Kullback-Leibler (K-L) sense. 

We also establish theoretically that the proposed estimator has lower posterior mean squared 
error in estimating signals compared to the horseshoe and achieves the optimal Bayes risk in 
testing up to a constant. For global-local scale mixture priors, we develop a new technique 
for analyzing the marginal sparse prior densities using the class of Meijer-G functions. In sim¬ 
ulations, the horseshoe+ estimator demonstrates superior performance in a standard design 
setting against competing methods, including the horseshoe and Dirichlet-Laplace estimators. 

We conclude with an illustration on a prostate cancer data set and by pointing out some direc¬ 
tions for future research. 

Keywords: Bayesian; global-local shrinkage; horseshoe; horseshoe+; normal means; sparsity. 


1 Introduction 

Ultra-sparse signal detection provides a challenge for developing statistical estimators. In the 
classical normal means inference problem, we observe data from the probability model (y,|0 ; ) ~ 
A/”(0;, 1) for i = 1,..., n. We wish to provide an estimator for the vector of normal means 0 = 
(01, ...,6 n ). Sparsity occurs when a large portion of the parameter vector contains zeros. The 
"ultra-sparse" or "nearly black" vector case occurs when the parameter vector 0 lies in the set 
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lo[pn\ = : #(9i 7^ 0) < p n } with the number of non-zero parameter values p„ = o(n ) where 

p n —> oo as n —> oo. 

To motivate the need for developing new prior distributions, consider the classic James-Stein 
"global" shrinkage rule, 6j$(y)- This estimator uniformly dominates the traditional sample mean 
estimator, 9. For all values of the true parameter 9 and for n > 2, we have the classical mean 
squared error (MSE) risk bound: 


R(6js,6): =Ey\ g \\9js(y) - 9 1| 2 < n = E y |Ji/ - 0\\ 2 , V0. 


However, for a sparse signal, 9js(y) performs poorly Suppose that the true parameter 6 is an 
"r-spike" with r coordinates of magnitude \Jn /r and the rest set at zero, giving ||0|| 2 = n. Then 
Johnstone and Silverman (2004) showed that the classical risk satisfies R ( 9js,9 ) > n/2 whereas 
simple thresholding at %J2 log n performs with risk ^/log n. 

To address this issue, a "global-local" shrinkage estimator called the horseshoe estimator was 
proposed by Carvalho et al. (2010). The horseshoe estimator, 0 H g(i/), provides a Bayes rule that 
inherits good MSE properties of global shrinkage estimators and simultaneously provides asymp¬ 
totic minimax risk for estimating sparse signals. For example. Poison and Scott ( 2012j ) showed that 
9hs (}/) uniformly dominates the traditional sample mean estimator in terms of MSE and van der 
Pas et al. (2014 ) showed that the horseshoe estimator has good posterior concentration properties. 
Specifically, the horseshoe estimator achieves 


sup E y | e ||0Hs(y) — 9\\ : 
delolpn] 


Pn log (ft/ Pn) , 


which is the asymptotically minimax risk rate in ii for nearly black objects (Donoho et al.| jl992J . 
Here the "worst" 9 S lo[_Pn ] is obtained at the maximum absolute difference \0 HS (y) — y | where 
§Hs{y) = Ehs(0|j/J can be interpreted as a Bayes posterior mean which is optimal under the Bayes 
MSE. 

Though the horseshoe prior was originally designed to provide an accurate and efficient esti¬ 
mator of a sparse normal mean vector, it turns out that the multiple testing rule induced by the 
horseshoe prior also enjoys the "oracle property" in testing under the 0-1 loss (Datta and Ghosh 


2013). For the multiple testing problem in the classical two-groups model, many approaches in 
volve explicitly modeling the ultra-sparse mean as a mixture of a point mass at zero and a heavy¬ 
tailed alternative, also known as the "spike-and-slab" approach (Mitchell and Beauchamp, 1988). 
This results in a posterior distribution over a high-dimensional discrete space, exploring which 
often leads to extreme computational cost. The one-group model, inspired by the widespread 
popularity of the lasso for variable selection in regression (Tibshirani, 1996[ >, is computation¬ 
ally more tractable, and can be used to select a model through concentration of measure in a 
space of pseudo-probabilities, rather than in the (/-dimensional Euclidean space (Carvalho et al. 


2010 Datta and Ghosh 20 13[ [Poison and Scott 2010[ ). In particular, the horseshoe prior leads 
to "pseudo-posterior" probabilities that mimic the true posterior inclusion probabilities from a 
two-groups mixture model, and induces a multiple testing rule with attractive properties. Specif¬ 
ically, Datta and Ghosh (2013) proved that the Bayes risk for the horseshoe estimator attains the 
Bayes risk of the oracle if the global shrinkage parameter is of the same order as the proportion 
of sparisty using the asymptotic framework introduced by Bogdan et al. (2011). Thus, it seems 
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natural to require that any new sparse signal recovery prior should attain the oracle risk up to a 
multiplicative constant, and improve upon the error rates in theory as well as in practice. The gen¬ 
erality of the Bayes risk results was conjectured by Datta and Ghosh (2013 ) and proved by Ghosh 
et al. (2013) in a recent unpublished manuscript. Ghosh et al. (2013) proved that asymptotic Bayes 
optimality holds true for a general class of shrinkage priors where the local shrinkage parameter 
follows a distribution with a slowly-varying component bounded away from 0 and oo. This class 
of shrinkage priors includes many of the recently introduced priors such as the horseshoe, the 
normal-exponential-gamma (Griffin and Brown 2010|), the three-parameter beta (Armagan et al. 


2Q11| >, and the generalized double Pareto (Armagan et al.j 2013| ), among others, but this class ex¬ 
cludes the horseshoe+ prior, since its heavier tail is slowly varying but is not bounded above. 

In the light of the previous works, the purpose of our article, then, is to provide an estimator 
that sharpens the ability of the Bayes estimator to extract signals from sparsity while maintain¬ 
ing the optimal properties of the induced decision rule. We provide theoretical justifications by 
demonstrating that the proposed estimator has sharper information theoretic bounds and better 
MSE bounds compared to the horseshoe estimator. We illustrate that the horseshoe+ estimator 
achieves greater separation of signals and noise in a standard simulation setting and we provide 
a comprehensive MSE comparison with existing sparse estimators. We develop a hierarchical 
model which is a natural extension of the horseshoe model of Carvalho et al. (2010) and hence our 
terminology for the horseshoe+ hierarchical model. 

The rest of the paper is outlined as follows. Section [2] motivates the class of one-group global- 
local mixture shrinkage priors for sparse signal estimation as a suitable alternative to the com¬ 
monly used two-groups models. Section [3] describes the horseshoe+ estimator with a particular 
reference to global-local shrinkage estimators. Section [4] provides theoretical properties of our 
proposed estimator. Our major findings can be summarized as follows: 

1. The decision rule induced by the horseshoe+ prior attains the risk of Bayes oracle under 0-1 
loss up to a multiplicative constant, with the constant in Bayes risk close to the constant in 
oracle. We also obtain a sharper bound on the probability of type-I error compared to the 
horseshoe prior. 

2. The posterior mean squared error for the horseshoe+ estimator is always smaller than the 
posterior mean squared error of the horseshoe estimator in estimating a large signal. 

3. The estimated sampling density using the horseshoe+ prior converges to the true density 
at a super-efficient rate when the true parameter value is zero, when the efficiency is calcu¬ 
lated using the Kullback-Leibler (K-L) distance between the true density and the estimated 
sampling density. The upper bound of the risk for horseshoe+ is shown to be smaller than 
that of the horseshoe estimator using asymptotic properties of the prior utilizing Meijer-G 
functions ( jMathai et al. 2009). 

Section[5]provides comparisons of our proposed approach with other shrinkage rules using a stan¬ 
dard design setting. We compare horseshoe+ with the Dirichlet-Laplace estimator (Bhattacharya 
et al. 2014) and the horseshoe estimator ( jCarvalho et al. 2010| , illustrating superior performance 
of the horseshoe+ estimator in both estimation (under squared error loss) and testing (under 0-1 
loss). Section [6] discusses the application of the proposed prior on a high-dimensional prostate 
cancer data set. Section[7]concludes with some directions for future research. 
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2 The one and two groups models 


Consider the model of Section 1, i.e., (y ; |0,) ~ J\f(9j, 1), for i = 1,...,n, where 9 is ultra-sparse or 
nearly-black, in the sense that 0 £ Iq [p n ]. Our interest might lie in testing whether each 6, is zero or 
non-zero, based on a suitably normalized test statistic or in proposing a suitable estimate 6j, that 
has attractive properties, e.g., low mean squared error. The large number of parameters together 
with sparsity require further modeling of the data to facilitate learning via empirical Bayes or 
full Bayes methods. The two-groups model (see, e.g., Efron[ 2008 Mitchell and Beauchamp [T988 ), 
provides a natural Bayesian hierarchical framework for the sparse multiple testing problem where 
conditionally i.i.d. 6j are modeled as 


Bill* = (1 - + /CV(0, i/’ 2 ). 


( 2 . 1 ) 


where <5j 0 } denotes a point mass at zero and the parameter ip 2 > 0 is the non-centrality parameter 
that determines the separation between the two groups. Under this setting, the marginal distribu¬ 
tion of x/i | ]A is given by 

1 Mf ~ (1 - hW(0, 1) + /CV"(0,1 + i/> 2 ). (2.2) 


As can be seen from Equation \2.2) , the two-groups model leads to a sparse estimate, i.e., it puts 
exact zeros in the model. The two-groups model enjoys a number of attractive theoretical proper¬ 
ties, detailed as follows: 


1. Johnstone and Silverman (2004) showed that a thresholding-based estimator for 6 under the 
two-groups model with an empirical Bayes estimate for p is minimax in £2 sense. 


2. Castillo and van der Vaart (2012) treated a full Bayes version of the problem and again found 
an estimate that is minimax in i 2 . 


3. Bogdan et al. (2011) found that the estimator under the two-groups model provides asymp¬ 
totically optimal performance in testing, in the sense that its performance matches the Bayes 
oracle up to a constant. 

Thus, while the two-groups approach is a recognized gold-standard for Bayesian sparse signal 
detection and estimation, a number of arguments favor an alternative approach via the global- 
local shrinkage priors, also termed as the one-group model. First, in many real life applications, 
such as studies involving "high-dimensional, low sample size" gene expression data, the majority 
of the effect sizes are negligible, but not exactly zero, leading to an argument against exact sparsity 
induced by the model in Equations ( |2.1||2.2| l. From a more pragmatic point of view, the one-group 
model leads to much faster computation, owing to the simple batch updating in the Gibbs sampler 
for the latent local shrinkage parameters. We refer the readers to Carvalho et al. ( [2010 1 for further 
arguments and insights. 

A useful outcome of the two-groups model is that the posterior mean E(0 ! |y ! ) can be written 
as follows: 


t 


E( 0 i\Vi) = ~ w f y f ( 1 + o(l)) as rp 


00, 


(2.3) 


where ay = P(9, 7 ^ 0|y,) is the posterior inclusion probability. Looking at the form of the posterior 
mean, one can see that it involves a "global" component ip 2 / (1 + ip 2 ) that provides shrinkage 
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towards zero for all the parameters. However, the "local" component <x>{ allows the signal terms 
to escape from being too close to zero. The lack of a local shrinkage term explains why Stein-type 
global shrinkage estimators perform poorly in a nearly-black setting. 

The key to success in a one-group model is to design a "global-local" shrinkage term that gives 


theoretically attractive properties along with a considerably easier computational implementation 
compared to the two-groups model. 


the same form of the posterior mean as in the two-groups model. The horseshoe prior of Carvalho 


et al. (2010) is one such global-local shrinkage prior that has been shown to possess a number of 


1. Carvalho et al. (20101 showed the horseshoe estimator has good information theoretic prop¬ 
erties when the true parameter vector is sparse, in the sense that the K-L distance between 
the estimated and the true densities decreases at a super-efficient rate. 


2. Datta and Ghosh (2013) proved that the decision rule induced by the horseshoe estimator 
is asymptotically Bayes optimal for multiple testing under 0-1 loss up to a multiplicative 
constant. 


showed the horseshoe estimator is minimax in £2 in a nearly-black 
case up to a constant. The constant they have been able to achieve is at least twice as large 
as the minimax constant of Donoho et al. (1992). 


3. van der Pas et al. (2014 


These theoretical properties, coupled with the ease of computational implementation suggests the 
one-group model holds considerable promise. Some other important examples of the one group 
model include the three-parameter beta prior (Armagan et al., 2011), the normal-exponential- 
gamma prior (Griffin and Brown, 2010), the generalized double Pareto prior (|Armagan et al. 


2013p, the generalized shrinkage prior (Denison and George 2012) and the Dirichlet-Laplace prior 


(Bhattacharya et al,, 2014). Below we describe the one-group horseshoe hierarchical model and 
then proceed to propose the horseshoe+ model that leads to considerable improvements upon the 
horseshoe. 


3 The horseshoed- estimator 

Given normally distributed data (y ; |0,) ~ 1), the horseshoe hierarchical model is defined by 

the set of conditional distributions 


(0f|A;,T) ~ M (0,A f) , 

(A;|r) ~ C + (0, t) , 

where C + denotes a half-Cauchy distributed scale parameter A, with density 

2 


(3.1) 


P(Af|r) = 


7TT{1 + (A,-/t) 2 }' 


(3.2) 


as discussed by Gelman ( |2QQ6| >. The horseshoe+ hierarchical model is defined similarly by the set 
of conditionals 


(0i|Aj, t]i,r) ~ A f (0,A?), 


(3.3) 
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(A,-|j7f/T) ~ C + (0,TJ/ f ), 

7f~C+ (0,1), 


where we have introduced a further half-Cauchy mixing variable rjj. In both models, the local 
shrinkage random effects A,'s are not marginally independent after mixing over the global shrink¬ 
age parameter r. The horseshoe+ model builds on the horseshoe by assuming that the A/s are 
conditionally independent given another level of local shrinkage parameters rjj s, in addition to r. 
Integrating over gives the the density of A; as 


P(A,-|t) = 


4 log (A;/ t) 
7T 2 T (A,/t ) 2 — 1 ’ 


(3.4) 


Although conceptually a natural extension, we will see that the additional log (A,/ r) term in the 
numerator leads to very different properties of the proposed estimator compared to the horseshoe. 
There are a number of ways of dealing with the global shrinkage parameter r. In a full Bayesian 
approach one can put a standard half-Cauchy prior or a Uniform (0,1) prior on r. Another ap¬ 
proach is to appeal to an asymptotic argument that suggests that the empirical Bayes estimator of 
t to be set to T = p n /n, where p n is the number of non-zero entries in 9 (van der Pas et al.. 2014). 

To further develop the distributional properties of the horseshoe+ prior we write this as a 
member of the class of global-local shrinkage priors with marginal prior density 

poo 

p(0f|r) = J p(0i|Af,T)p(Ai|T)dA f . 

Transforming to a shrinkage scale with k; = 1/(1 + A?r 2 ) yields 

r\ f 7C’ 

p(0;|r) = / p(6i\Ki,T)p(Ki\r)dKi, with p(0 1 -|k i -,t) ~ J\f ( 0 ,-- 

J 0 V K i 

where Kj S [0,1] is a shrinkage "weight". The corresponding "ultra-sparse" Bayes estimator is 


§i = E(0 t -|y f ,T) = (1 - E(Ki|yi,r))y/, 


(3.5) 


where we need to compute E (jc; j/„ r). By comparing the expression for the posterior mean for 0,- 
for the one-group model given by Equation ( |3.5| ) to the two-groups model given by Equation ( |2.3| , 
it is apparent that the quantity d>; = 1 — E(jc;|y;, r) behaves as the posterior inclusion probability 
P(6j 7 ^ 01 \h ). This results in a natural threshold for simultaneously testing H 0; : 6, = 0 vs. Hu : 
6i 0 for i = 1 We will consider the following multiple testing procedure proposed by 
Carvalho et al. ( |2010| ), and later shown to be optimal under 0-1 loss by Datta and Ghosh ( |2013 1, for 
the horseshoe prior: 

Reject H oi : if 1 - E^jy,-, r) > (3.6) 


3.1 Shrinkage profile 

Note that the marginal data likelihood is p(y;|jc,-,r) = K- 1 exp ( —k,i/ 2 / 2 ) . Signals are identified 
when Kj —> 0 and sparsity occurs when Kj —> 1 in the posterior. We see that there are no shrinkage 


6 
















20 



0.00 0.25 0.50 0.75 1.00 

Kj 


Figure 1: The horseshoe+ (top) and horseshoe (bottom) prior Jacobian terms against Kj for r = 0.5 
and 2. The vertical lines are at k = 1/(1 + r 2 ). 


factors in the marginal likelihood to "help" identify signals in the normal model as p(yj\Kj, r) —> 
0 as Kj —> 0. This is precisely why the normal prior performs poorly for sparse settings. The 
horseshoe prior was designed to cancel the factor kJ ,/2 and to simultaneously places prior mass at 
Kj = 1 to introduce shrinkage (see Carvalho et al. ( j2Q10| > for further discussion). The priors on the 
local shrinkage factor A,- and the induced prior on Kj for the horseshoe, the horseshoe+ and the 
generalized double Pareto prior are summarized in Table [Tj 


Table 1: Priors for A,- and Kj for some global-local shrinkage rules. 


Prior for 0, 


Prior for A; 


Prior for k, 


GDP 


VI 

(A?) 


( ! 2u 7/| t rr,A u i 

V^exp{5p2^}E rfc{y^} ^ 

l y a? UjV liau 2(i7 Ki ) 2 

y/2Ki(l-Ki) 


Horseshoe 2/ + (A,/t) 2 )} 

Horseshoe+ 41og(A,/T)/ {n 2 r((A.j/r) 2 — 1)} 


T _1_ 

l/K;(1-7C;) (1 +K,'(t 2 -1)) 

T lQg{(l-K,)/KjT 2 } 

a/kHI-K;) (1-K;(t 2 + 1)) 


The main difference between horseshoe+ and the others is in the extra Jacobian term intro¬ 
duced in the representation on the shrinkage scale. This term has a fundamentally different be¬ 
havior for separating signals (k, = 0) from the noise terms (k, = 1). The horseshoe+ prior in¬ 
troduces another horseshoe U-shaped Jacobian factor that pushes posterior mass to the places of 
most interest, Kj = 0,1. This provides horseshoe+ prior with an additional power to detect signals 
in the ultra sparse signal case. Figure [ljplots the Jacobians of the horseshoe and horseshoe+ priors 
with t set to 0.5 and 2 to make the difference explicit. The horseshoe Jacobian displays unequal 
shrinkage behavior near the two extremities of Kj. 
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This extra shrinkage through the Jacobian is an unique property of the horseshoe+ prior not 
shared by any of the other univariate shrinkage priors. To see this, note that the 'sensible' priors 
can be expressed in terms of a slowly varying function following Theorem 1 of 
( 2010 ): 


Poison and Scott 


p(A?) oc (A2)(- fl “ 1 )L(Af) for r 2 = 1 (3.7) 

p(K f ) oc (1 - K f ) ( ~ fl " 1) K{ fl ~ 1) L(l/jc i - 1) (3.8) 


where L(-) is a slowly varying function with the property L(ty) / L(y) —> 1 as y —> oo. In a re¬ 
cent unpublished manuscript, Ghosh et al. (2013) showed that the popular shrinkage priors like 
the three-parameter beta (TPB, Armagan et al. ( |2011J ), which includes the popular Strawderman- 
Berger prior, the horseshoe prior and the normal-exponential-gamma prior, as well as the gen¬ 
eralized double Pareto (GDP, Armagan et al.| (2011|) prior fall into this class. Furthermore, the 
authors proved that the slowly varying component of ( 3.7 ) is bounded as A 2 —> oo for these popu¬ 
lar shrinkage rules, i.e. lim A 2 >co L(A 2 ) E (0, oo) for priors such as TPB and GDP. This is where the 
horseshoe+ prior stands out from the rest, as the slowly-varying component for the prior density 
Phs+{ A 2 ) is unbounded as A —» oo, i.e. 


lim L HS +(A 2 ) = lim log(A 2 ) 1 - 

A?->oo A ?—>oo \ A- 


OO. 


Since K t —> 0 as A 2 —> oo, the unboundedness of Lhs+(A 2 ) = L(1/k,- — 1) together with ( |3.8J 
implies that the extra shrinkage at lim Ki > o p (Kj ) — > oo only holds for the Horseshoe+ prior among 
all shrinkage priors expressible as heavy-tailed Gaussian scale mixtures. The Jacobian term can 
also be interpreted on the shrinkage scale. Specifically, for k = 1/(1 + r 2 ), we have 

^ .^K,y)°<n 7 T h I exp {- t ,|} |1 ° 8 ( (1 AAA " 


This representation shows that the horsehsoe+ prior allows differential shrinkage for k, around 
k (and is continuous at k ; = jc), and suggests that the global shrinkage parameter r 2 can also be 
interpreted as a scaling factor for the shrinkage weights k,. 


4 Theoretical properties of the horseshoe+ estimator 

In this section we establish a few theoretical properties for the proposed prior and the resulting 
posterior, from both a decision theoretic and information theoretic viewpoint. We present our 
main results in the form of seven theorems. Proofs and technical details are given in Appendix 
lAdHATl 

4.1 Marginal density for the horseshoe+ prior 

We start by formally establishing that the marginal prior density for horseshoe+ is unbounded at 
the origin. 
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Figure 2: Marginal prior densities near the origin. The legends denote the horseshoe+ (HS+), 
horseshoe (HS), Dirichlet-Laplace (DL), generalized double Pareto (GDP), Cauchy and Laplace 
priors. 



Figure 3: Marginal prior densities in the tail regions. The legends denote the horseshoe+ (HS+), 
horseshoe (HS), Dirichlet-Laplace (DL), generalized double Pareto (GDP), Cauchy and Laplace 
priors. 


THEOREM 4.1. Assume t 2 = 1. Then the marginal density of the horseshoe+ prior, p^s- i-(0), satisfies 
the following properties: 


1. 


TO 


V2 


n 


log 



< Phs+{0) < 


1 

7T 2 1 6 


9 





















2 . 


ta PHS+ («) = =o. 


A proof is given in Appendix |A.1| Figures |2] and [3] show the behavior of several global-local 
shrinkage priors near the origin and at the tails. The priors considered here are: horseshoe+, 
horseshoe ( |Carvalho et al. 20101, Dirichlet-Laplace (Bhattacharya et al. 2014), generalized double 
Pareto (Armagan et al., 2013), standard Cauchy, and standard Laplace (double-exponential). Note 
that the horseshoe+ and horseshoe densities are unbounded near the origin and more importantly, 
horseshoe+ puts more mass compared to the horseshoe in a small neighborhood of the origin and 
has heavier tails. Carvalho et al. (2010) established that a prior with unbounded density near the 
origin leads to super-efficiency in density estimation in a sparse signal setting. Due to Theorem |4.1| 
the horseshoe+ estimator enjoys the resultant advantages as we shall show in Section 14.31 


4.2 Asymptotic Bayes optimality under sparsity 


Datta and Ghosh (2013) proved that the Bayes risk optimality for the horseshoe prior leverages the 
fact that the shrinkage weight 1 — k, concentrates near one (uniformly in y,) if the global shrinkage 
parameter r —> 0, and concentrates near zero if |j/,| —> oo for any fixed r in (0,1). To attain the 
Bayes risk of the oracle, one additionally needs the global shrinkage parameter r to adapt to the 
underlying proportion of non-zero effects//„, i.e. ]im„_ >00 ry~ 1 £ (0, oo ), where u n =#{0/ ^ 0 } / n. 
It turns out that similar concentration inequalities, but with sharper bounds, hold for the poste¬ 
rior distribution of jq under the new horseshoe+ prior. At an intuitive level, this suggests that 
the decision rule induced by the horseshoe+ prior will also inherit the same, if not better, opti¬ 
mality properties. In this section, we state two posterior concentration inequalities along with the 
asymptotic type-I and type-II error probabilities to establish the oracle property for horseshoe+. 

Below, we briefly describe the notion of Bayes oracle in the context of multiple testing follow¬ 
ing the asymptotic framework of Bogdan et al. (2011). Assume the two-groups model of Equations 

9j = 0 vs. Hu : 0, 0 is 


(2.1 Z2) . The optimal Bayes rule under a 0-1 additive loss for testing Hq; 
given by: 


Reject Hoj if | y z -1 > C 


where. 


C 2 = C 2 tf = (log {xp 2 + 1) + 2log/) where/ = 

Y r 


(4.1) 


We call this rule the Bayes oracle as the risk for this is the lower bound of (1/n) times the risk 
for any multiple testing procedure under the two-groups model. Bogdan et al.|( 201l[ ) further re¬ 
parametrized this by u = if) 2 and v = uf 2 , to obtain the following simpler form for the threshold 
in the oracle: 

C 2 = (1 + ^)(logu + log(l + h). (4.2) 

It VI 

For maintaining clarity of notations and preserving correspondence with the original work of 
Bogdan et al. (2011|), we use the same asymptotic framework in form of the following assumption: 


10 












































ASSUMPTION 1. The sequence of vectors 7 „ = (ip n , p n ) satisfies the following conditions: 


Y-n -A 0; M„ = t p 2 ^ 00 ; V n = U n fl = l/> 2 


log^„ 

ll n 


C E (0, 00 ) AS H —t 00 . 


/1 

V Yn 


2 

—> 00 ; 


REMARK 4.1. The asymptotic framework provides a natural way to study the properties of the Bayes risk 
as the parameter vector 7 = (tp,y) defining the Bayes oracle in Equation B varies through an infinite 
sequence indexed by the number of tests n increasing to infinity. To reduce notational complexity, we will 
suppress the index n from j n , Pn, T„, fn throughout the remainder of this section. The statements such as 
y —> 0 should imply that p n —>■ 0 as n —> 00 . 


REMARK 4.2. It shoidd be pointed out that the conditions are not restrictive, and are in fact minimal 
conditions for optimality in some sense. On one hand, the Bayes Oracle is no better than a coin toss if the 
limit ip 2 /2\og(l/y) —» 00 , making the test powerless, and has zero type-II error when the limit goes to 
zero, which could happen if one has an infinite number of replicates. The interesting cases are obtained for a 
finite, non-zero limit which Bogdan et al. (2011) term as "verge of detectability" and our results pertain to 
this situation. 


Under Assumption [jj the type-I and type-II error probabilities of the Bayes oracle are given by 
(Bogdan et al. (2011)): 


,BO 

h 


= e 


-C/2 


7TV log V 


(1 + On), 


fP° = (20{y/C)-l)(l + On), 

R op t = n ((1 - p)tf° + = np(2<P(VC) - 1)(1 + o n ), 


(4.3) 


where o n denotes an infinite sequence of terms, indexed by n (the number of tests), converging to 
zero as n —> 00 . The last expression follows from the fact that the Bayes risk for a fixed-threshold 
multiple testing rule is given by R = n ((1 — p)t] + ^ 2 ) for an additive 0 — 1 loss, when t\,t 2 
denote the type-I and type-II error probabilities respectively. A decision rule is said to attain the 
asymptotic Bayes optimality under sparsity (or, ABOS) if the ratio of the Bayes risk of the decision 
rule to the risk of the Bayes oracle (Equation |4.3| goes to 1 as multiplicity n — > 00 . Now, we present 
the first concentration inequality on the posterior distribution of k, providing the conditions under 
which the posterior mass of ;c, concentrates near one. We show that an upper bound to the he 
posterior mass of k, e (0 , e), decays as r 2 . 


THEOREM 4.2. Suppose zve have observations y\,... ,y n where yi ~ N(6 u 1), for i = 1 ,...,n, and 
the prior on 6 t is distributed as horseshoe+ with the hierarchical model given by ( |3.3| >. Then the posterior 
distribution ofKj = (1 + A 2 t 2 ) -1 given y\ and r satisfies the following: 


P(Kf < e\i /j, r) < e 2 r 2 e(l - e) 2 , 


(4.4) 


for any fixed e G (0,1), and any r G (0,1). 
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The proof is given in Appendix |A.2| Theorem 4.2 implies that the posterior distribution of Kj 
given t and the observation y,- would converge to a point mass one if r —> 0. This leads to the 
following bound on the probability of type-I error rate for horseshoe+ prior, with proof given in 
Appendix |A.3| 


THEOREM 4.3. Suppose we have observations y\,... ,y n from the 'two-groups' model in Equation \2.2) , 
and we want to test Hq, : 0; = 0 us. Hu : 9j f 0, using the decision ride of Equation ( |3.6| > induced by the 
horseshoe+ prior. Suppose furthermore that Assumption^holds for the parameter vector then the 

probability of type-I error for horseshoe+ decision rule is given by: 


h < 


71 ^/log^^T 


= ( 1 + 0 ( 1 ))- 


REMARK 4.3. It should be noted that one of the bounds (and the type-I error rate) obtained for the horse¬ 


shoe+ prior are sharper than that obtained for the horseshoe prior. Theorem 4.2 shores THs+( K i < e| y;, r) = 
0(r 2 ) whereas Datta and Ghosh ( 2013 obtained P ns( K i < £ \l G t) = O(r). This relative gain will not 


affect the asymptotic order of the total Bayes risk derived here, but this result has interesting implications 
(e.g. lower false positives) nonetheless. 


We now present the second concentration inequality in the other direction, with a proof in 
Appendix |A.4| 

THEOREM 4.4. Suppose we have observations y\,... ,y n where yi ~ Midi, 1), for i = 1, ..., n, and 
the prior on 6j is distributed as horseshoe+ with the hierarchical model given by Equation ( |3.3| ). Then the 
posterior distribution ofK{ = (1 + A^r 2 )" 1 given yi and r satisfies the following: 


P(k,- > ?/|j/f,r) < e 




cm. 


(4.5) 


for any fixed // G (0,1), any fixed 5 G (0, l/f/(l 
constant independent ofyu 


r 2 )) and uniformly in y, G K, where C(q,5) is a 


A corollary of Theorem |4.4| is that the posterior distribution of Kj given r and y, would converge 
to a point mass at zero if | t / z -1 —> oo. 

A crucial step for proving the optimality for the horseshoe prior is the choice of the global 
shrinkage parameter r. Datta and Ghosh; ('2013) chose r to be of the same order as the proportion 
of signals p, i.e. r = r„ = 0(g n ). They also argued that the optimality of the decision rule induced 
by the horseshoe prior depends on how well the sparsity is captured in the hyper-parameter r. 
This was further supported by van der Pas et al. (2014) who showed that the condition r = 0(g) 
is a sufficient condition for the minimaxity properties of the horseshoe estimator. Since the role of 
t as a global scale parameter for the prior on local shrinkage parameters A, does not change with 
the horseshoe+ prior, intuitively the same choice on r would lead to the optimal type-II error rates. 
Under this choice of r, it follows that the type-II error for horseshoe+ decision rule has the same 
asymptotic order as that of the type-II error rate for the Bayes oracle. Let C denote the constant in 


the expression for the risk of the Bayes oracle as appears in Equation (4.2). Then it follows from 
Theorem |4.4|that the type-II error rate has the following upper bound: 
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THEOREM 4.5. Suppose we have observations y\,... ,y n from the ‘two-groups' model in Equation \2.2) , 
and wish to test H 0l : 0, = 0 vs. Hy : 6j 0, using the decision rule of Equation ( |3.6| . Suppose 
furthermore that Assumption^holds for the parameter vector ( ip , p), and the global shrinkage parameter r 
decreases to zero such that r = O(p). Then for all q G (0,1) and d G (0, 1 / 77(1 + r 2 )), the probability of 
type-II error of the decision rides induced by the horseshoe+ prior is bounded above by: 


*2 < 2<£(i 


q(l-5) 


VC)-1 (l+o(l)). 


The proof is given in Appendix |A.5| The proof of this theorem follows similar steps as the 
proof of type-II error rate for horseshoe prior in Datta and Ghosh (2013), where a fixed // = 1/4 
and S = 1/9 were used for deriving an explicit expression. Then it follows from Theorems |4.3| and 
|4.5|that the risk of the horseshoe+ decision rule is given by 


R «* -"Wfe- ’> + 


(i + o(i)) 


= n<p(20L 


q(l-5) 


VC)-1)1 (l+o(l)) as t —y 0. 


Since the risk of the Bayes oracle is RgQ = n |/ i (2^’(v / C) — 1)| (1 + o(l)), it follows that the 
horseshoe+ decision rule attains the Bayes oracle up to a multiplicative constant. 


4.3 Kullback-Leibler risk bounds 


Carvalho et al7| ( |2010j ) proved that for horseshoe the Bayes estimate for the sampling density, mea¬ 
sured using the Kullback-Leibler distance between the true model and the estimator of the density 
function, converges to the truth at a super-efficient rate. Let 9$ be the true parameter value and 
f(y\0) be the sampling model. Further, let k ( 7 / 1 , 772 ) = IF, ; logit]] /q 2 ) denote the K-L divergence 
of a density 7/7 from q \. The proof utilizes the following result by Clarke and Barron ( 199Q| ). 


PROPOSITION 4.1. (Clarke and Barron, 1990). Let v n (dO\yi,... ,y n ) be the posterior distribution 
corresponding to some prior v(d6) after observing data y^ = (py\,... ,y n ) according to the sampling 
model f(y\6). Define the posterior predictive density q n (y) = f f(y\9)v n (d9\yi,... ,y n ). Assume fur¬ 
ther that v(A e ) > 0 for all e > 0. Then the Cesaro-average risk of the Bayes estimator, defined as 


Rn = n 1 k(q eo , +), satisfies 


R n <£ -logv(A e ), 


where v(A e ) denotes the measure of the set {9 : K(qe 0 , qe ) < e}- 


Using the above proposition. Theorem 4 of Carvalho et al. (2010) proves that for the horseshoe 
estimator the Cesaro-average risk satisfies 


R “ = °ls log ((tai^ 


(4.6) 
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when the true parameter 0 q = 0. This rate is faster than any prior without a pole at zero. It is super¬ 
efficient, in the sense that the risk is lower than that of the MLE, which has the rate 0(log n/n). 
The same result holds for the horseshoe+ estimator due to its infinite mass near zero (by Theorem 
|4.1[ ). However, we demonstrate that the horseshoe+ prior in fact has a better rate of convergence 
than the horseshoe prior. Our result is based on the following theorem. 

THEOREM 4.6. Let p° HS+ {6) and p° HS {0) denote the marginal densities of the horseshoe+ and horseshoe 
priors at the origin. Then we have 


L X ^ e)ie = 7Ttkr„( 


log 2 (n) 


1 - 


7 . log (4) 


log(u) + 0 ( 1 ) 


where 7 is the Euler-Mascheroni constant and 


L PHS ^ dd V 2 


/log(n) 


0 ( 1 ) 


The proof is given in Appendix A .6 Due to the extra log (ft) factor, the horseshoe+ prior places 
more mass around a neighborhood of the origin compared to the horseshoe prior. Thus, when 
00 = 0, setting e = 1/n gives after some algebra from Proposition|4.1|that 


Rn{HS ) < 


log n 
In 


1 

n 


log log n 


const. 


and 


„ /TT „ \ log n 1 2 log log n 

R n (HS+) < — -I-- —— + const. 

2 n n n 

Therefore, the multiplier of the log log n term improves for horseshoe+. 


4.4 Mean squared error 

It is well known that if p(|y,- — 0;|) is the standard normal density and p (0,) is a zero mean scale 
mixture of normals, with the scale parameter A 2 following a proper prior law, the posterior mo¬ 
ments of 9j admits the following representations, also known as "Tweedie's formula" ([Efron 2011): 


E(0«|yi) =yi + Wi logm( yi ), 


V(%) = 1 


d 2 

d V 2 i 


2 log m(yi), 


(4.7) 


(4.8) 


where m(y,) is the marginal for i/, (see for example Pericchi and Smith (|1992> and Carvalho et al 


(2010)). Furthermore, we can use properties of slowly varying functions to show that if the prior on 
9{ can be written as a normal scale mixture with a "slowly-varying" prior on the scale parameter, 
the marginal inherits the slowly varying property. For priors with a polynomially heavy tail it 
can also be shown that the resulting posterior mean is asymptotically robust, in that the difference 
\E(Qi\iji, t) — \ji\ vanishes for large |y z -1 while r is fixed. 

Heavy-tailed distributions are often characterized by the notion of regular variation. The fol¬ 
lowing definition is due to Karamata (see Mikosch (|1999) or Bingham et al. (|1989|) for a detailed 
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discussion). 


DEFINITION 4.1. A positive, measurable function L is called regularly varying at infinity with index a 
if it is defined on the interval [xq, oo) for some xo and 

lim = t a for all t > 0 

X->+oo L{x) 


L(-) is called a slozvly varying function at infinity if a = 0. 


Using the above definition, we state the following result from Theorem 6.1 of Barndorff- 
Nielsen et al. (1982) . 


PROPOSITION 4.2. (Barndorff-Nielsen et al, 1982). Consider the Gaussian scale mixture y|A 2 ~ 
Normal(0,A 2 ) and suppose the prior density of A 2 is given by /(A 2 ) = (A 2 ) a- 1 L(A 2 ) as A 2 —> oo, 
where L(-) is a slowly varying function. Then the marginal m(y) after integrating out A 2 has the property 
thatm(y) oc \y\ 2oi ~ l L(y 2 ) as \y\ —> oo. 


Let mj-is+ (yd and m^s (yi) denote the marginals under the horseshoe+ and horseshoe priors re¬ 
spectively. Proposition |4.2| immediately shows that we have that nins+ (yd = m Hs(yd l°g(|]//1) (1 + 
o(l)) as \yi\ —> oo, since the only difference between the horseshoe and horseshoe+ mixing densi¬ 
ties is the additional slowly varying (log A/) term in the scale mixing density for the horseshoe+ 
prior. In particular, as |j/,| —» oo, we have 


m H s+(yd = mHs(yd * iog(|y,i) x 


fi~ i 


x constant. 


yf + 1 

where "constant" denotes the collection of all terms that does not involve y\. Thus, 

d , r \ d . , , 1 4 yj 

W. 1 ° 8 ""' s+(y ' ) = ¥ log ’" Hs(y ' ) + - yM' 


and. 


0(l/yf) 


d 2 . d 2 . 1 +logy, , o, 

dif log ’" HS+(y ' ) = lyi - (yilogyi)2 + o (1/y,). 


(4.9) 


(4.10) 


Using Equations ( |4.7[ ) and ( |4.8| >, in combination with Equations ( |4.9| and ( 4.10| , allows one to relate 
the bias and variance, and hence the MSE, for the horseshoe and the horseshoe+ estimators. We 
have the following result: 


THEOREM 4.7. Suppose p(\y, — 0,-|) is the standard normal density, and pi-is(@d anc l PHS+(@d 
note the horseshoe and horseshoe+ prior densities on 6j, leading to the posterior mean squared errors 
MSEhs(0z|J6') an d MSE H s+(0«|yi) respectively. Then,for large values of |y,|, we have, 


MSE HS +(0f|y/) =MSE HS (0,-|y,-) 


1 

y^°s\yi 


+ o 
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The proof is given in Appendix |A.7| This theorem establishes that the horseshoe+ estimator 
has asymptotically lower MSE compared to the horseshoe estimator when |y, | is large, due to the 
extra (log |y,-|) factor in the marginal, which in turn is due to the extra (log A,) term in the prior 
mixing density 


5 Numerical examples 

5.1 Sum of squared error about the posterior median 

We follow the simulation setting described in Bhattacharya et al. (2014). We simulate data y;|0; ~ 
Normal(0„ 1) for i = 1 ,... ,n, where (9, = A in fraction q of its components with the magnitude 
of A = 7,8 and 0, = 0 in the remaining components. We report simulation results for n = 200 in 
Table [2] Each configuration is replicated 100 times and the average sum of squared error about the 
posterior median is reported. 


Table 2: Average SSE about the posterior median for n = 200 for the competing priors. The 
averages are computed over 100 replicates. The lowest SSE for each setting (in rows) is in bold. 


q 

A 

D-L 

HS Cauchy 

HS+ Cauchy 

HS Unif 

HS+ Unif 

0.05 

7 

26.86 

15.95 

18.58 

17.11 

18.08 


8 

22.49 

14.47 

15.97 

15.26 

17.42 

0.1 

7 

43.76 

33.92 

31.65 

35.13 

33.51 


8 

43.81 

32.28 

29.77 

33.67 

32.23 

0.2 

7 

78.11 

69.29 

59.26 

83.61 

59.92 


8 

82.64 

70.72 

62.64 

118.52 

63.69 

0.3 

7 

103.46 

104.33 

86.77 

322.93 

100.26 


8 

121.04 

108.12 

93.21 

373.71 

220.16 


We compare the proposed horseshoe+ prior with two competitors: the horseshoe prior of Car¬ 


valho et al. (j2010[> and the Dirichlet-Laplace (D-L) prior of Bhattacharya et al. (|2014|>. To deal with 


the global shrinkage parameter r for the horseshoe and the horseshoe-t- priors, we try two scenar¬ 
ios: (a) r ~ C + (0,1/n) and (b) r ~ Unif[0,1]. For posterior sampling, we use the Stan software 
package (Stan Development Team 2014) to draw 10,000 samples in each case, half of which are 
treated as burn-in and discarded. We monitored MCMC convergence and found no evidence of 
mixing problems. The D-L prior is implemented in its hierarchal normal-exponential form, and 


the horseshoe and horseshoe+ priors by the hierarchical model in Equations (3.11 and ( |3.3) respec¬ 
tively. 

In Table [2j the estimator with the lowest average SSE is in bold in each simulation setting (in 
rows). The horseshoe+ prior with the half-Cauchy prior on r has the lowest SSE in all but two 
cases, in which the horseshoe prior performs the best. The C + (0,1/n) prior on r results in better 
performance over a Uniform(0,1) prior for both horseshoe and horseshoe+ since the former puts 
more mass in a neighborhood close to zero, helping r adapt to the sparsity level of the data. 

To make the difference between the horseshoe and the horseshoe+ estimates clear, we plot 
E(Kj|i/,) and IE(0,|i/,) for i = 1,... ,n, for horseshoe in Figure |4] and for horseshoe+ in Figure [5] In 
both cases, the prior on r is Unif[0,1]. We used n = 200 and simulated y, with 10 components 
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Figure 4: Estimated K, and 0, for horseshoe for n = 200 with first 10 true 0, equal to 7 and rest 
true values set to 0. Dots are posterior means and solid lines are the middle 95% posterior credible 
intervals. We used r ~ Unif [0,1]. 
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Figure 5: Estimated rq and 0, for horseshoe-t- for n = 200 with first 10 true 6j equal to 7 and rest 
true values set to 0. Dots are posterior means and solid lines are the middle 95% posterior credible 
intervals. We used r ~ Unif [0,1]. 


with a mean equal to 7 and the rest with mean 0. Without loss of generality, the components 
(true values and estimates) with true non-zero means are plotted as the first 10 data points and 
those with true zero means are plotted afterwards. The posterior means are shown as dots and the 
middle 95% posterior credible intervals by solid lines. By comparing the estimates, it is clear that 
horseshoe+ does a much better job compared to horseshoe in terms of shrinking the noise terms 
to zero (estimated iq closer to 1 or equivalently, estimated 0, closer to zero). 
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Figure 6: Misclassification probability plots for the horseshoe+, horsesshoe, and the Dirichlet- 
Laplace (DL 1 / n ) shrinkage priors, Benjamini-Hochberg and the Bayes oracle for fi G (0.1,0.5). 


5.2 Misclassification probabilities 


We compared the performance of the multiple testing rule induced by the horseshoe+ prior with 
two other global-local shrinkage priors: the horseshoe prior of Carvalho et al. (2010) and the 
Dirichlet-Laplace prior of Bhattacharya et al. ( 201 4 j) in terms of the misclassification probability 
(MP). We use the misclassification probability as a criteria for our experiment as it is equal to the 
Bayes risk under a 0-1 additive loss for data generated by a two-groups model. We follow the 
same experimental set up in Bogdan et al. (2008), replicated in Datta and Ghosh ( 2013| ), where the 
Bayes oracle (BO) acts as the lower bound and the MP = ft line as the upper bound, where ft is 
the proportion of signals. We simulated data of size n = 200, xp n — a/ 2 log n = 3.26. Our data 
generation scheme follows the conditions provided by Bogdan et al. (2011), which guarantees the 
optimality of the Benjamini-Hochberg procedure to use it as another practical lower bound along 
with the Bayes Oracle. 

Figure [6] shows the misclassification probabilities (henceforth abbreviated as MP) for different 
shrinkage priors considered for ten equispaced values of fi G [0.01,0.5] along with the oracle and 
the straight line (MP = fi). Figure [6] shows that the misclassification probability for the horseshoe+ 
prior is very close to that of the Bayes oracle for a wide range of values of }i, and departs a little for 
values higher than 0.2. Furthermore, the horseshoe+ decision rule leads to a superior performance 
compared both the horseshoe and the Dirichlet-Laplace prior. We have also plotted the MP for the 
Benjamini-Hochberg rule, for a. = 1/ log n = 0.1887, along with the one-group shrinkage priors. 
Under this setting, the Benjamini-Hochberg rule achieves the same MP as the oracle. This is in 
concordance with the theoretical results for optimality of BH in Bogdan et al. (2011). 

We used the full Bayes estimates for the hyperparameters for both the horseshoe prior and 
the double exponential prior. For estimating t, we assumed standard half-Cauchy prior on r for 
deriving the full conditionals using a Gibbs sampler. As pointed out by Carvalho et al. (2009) 
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and Scott and Berger ( 20061 , the fully Bayesian approach for estimating r has a few adavantages 
over its alternatives, viz. empirical Bayes and cross-validation. In the extremely sparse case, the 
empirical Bayes estimate of r might collapse to 0 ( |Bogdan et al.[ [2008} |Scott and Berger} |2010~) . 
Cross-validation, though free of this problem, uses plug-in estimates for the signal-to-noise ra¬ 
tio. Carvalho et al. |2009| ) argue that the plug-in estimates are not necessarily wrong, but caution 
should be exercised while using them for extremely sparse problems. 


6 Application on a prostate cancer data set 

We illustrate the performance of the horseshoe+ prior for the benchmark prostate cancer data, in¬ 
troduced by Singh et al. ( |2002| > and made popular by Efron (2008 2010a|b| , among others. The 
prostate cancer data has gene expression values for n = 6,033 genes for m = 102 subjects, with 
vi\ = 50 normal controls and m 2 = 52 prostate cancer patients. The goal is to identify "genes" 
that are differentially expressed between controls and the cancer patients. To analyze this data 
further, the test statistic values are calculated for each of the 6,033 genes by first calculating a 
two-sample f-statistic, say t u i = 1,2,..., n = 6,033 for each of the genes and then applying the 
inverse Normal CDF transformation to obtain 1 /,• = 0 1 (t/j j . The y,-values can be mod¬ 

eled as independent Gaussian variables with mean 0/s, i.e. \ji ~ 0, + e, to cast this problem as a 
high-dimensional normal means inference problem. The corresponding multiple testing problem 
would be to simultaneously test the hypotheses H () , : 0, = 0, for i = 1 ,n. Under the global 
null hypothesis of no 'differentially expressed' genes, one should expect the histogram of the test 
statistics to follow a A/”(0,1) density curve but the histogram shows a heavier tail, suggesting the 
presence of a few regulatory genes. 

For a proper appraisal of the extra shrinkage by the horseshoe+ prior at the tails compared 
to the horseshoe prior, we do the following experiment: We consider the top 10 genes selected 
by Efron (j2010a) and their effect sizes estimated by a two-groups normal hierarchical model. We 
apply both the horseshoe and the horseshoe+ prior to the 6,033 test statistics, and compare the 
'effect-size' estimates 0, for these genes. One would expect that the horseshoe+ prior would shrink 
these "top" genes even less than the horseshoe prior and as a result the posterior mean 0 , = 
(1 — E(k,|i/ ; , r))yi would be closer to the observed test statistics 1 

Table [3] shows the top 10 genes selected by Efron ( j 201 0a , and the effect size estimates by the 
horseshoe and the horseshoe+ priors. For both the horseshoe and horseshoe+ prior, we imple¬ 
mented a Gibbs sampler with 15,000 draws with a burn-in period of 3,000 draws. The benefits 
of a heavier tail become apparent from this table as in 9 out of the top 10 genes, the horseshoe+ 
estimates are closer to the observed test statistics compared to the horseshoe estimates. One might 
naturally wonder about the performance of the two competing Bayesian models for the "uninter¬ 
esting" genes, and it turns out that both the priors have equal strength in squelching the noisy 
test statistics to zero. Figure [7] shows the posterior mean for the two priors against the observed 
test statistics. It can be clearly seen that all the procedures show good shrinkage properties near 
zero, and the only difference comes from the performance near tails, or robustness to large signals. 
This is also reflected in the value of the estimated mean squared prediction error calculated as 
MSE = (1/n) YH= i(0f — 1 fi) 2 . The values of the mean squared prediction error for the horseshoe+ 
and the horseshoe prior are 1.189 and 1.045 illustrating the superiority of horseshoe+ prior over 
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Figure 7: Posterior mean E(0,|i/ ! ) against y, for 6,033 genes for the horseshoe and horseshoe+ 
priors applied to the prostate cancer example. 


Table 3: The test statistics (y-values) and the effect-size estimates for the top 10 genes selected by 


Gene 

y-value 

e« s + 

ef s 

^Efron 

610 

5.29 

5.20 

5.12 

4.11 

1720 

4.83 

4.77 

4.54 

3.65 

332 

4.47 

3.24 

4.11 

3.24 

364 

-4.42 

-4.43 

-4.14 

-3.57 

914 

4.40 

4.40 

3.89 

3.16 

3940 

-4.33 

-3.78 

-3.77 

-3.52 

4546 

-4.29 

-3.88 

-3.46 

-3.47 

1068 

4.25 

3.71 

3.03 

2.99 

579 

4.19 

3.99 

2.88 

2.92 

4331 

-4.14 

-3.48 

-3.26 

-3.30 


the horseshoe prior. 

7 Discussion 


Efron (2010a ) by the horseshoe, horseshoe+ models, and Efron's two-groups model estimates. 


We have provided a default Bayesian shrinkage estimator for extracting signals from a sparse 
parameter vector. The proposed prior is called horseshoe+ prior as it renders itself to a natural 
extension of the horseshoe prior and provides substantial improvement for the "nearly-black" 
or "ultra-sparse" situation. In particular, the heavier tails of the horseshoe+ prior leads to an 
increasing ability of separating the signals, and the larger mass near origin leads to better handling 
of sparsity and a higher order of super-efficiency for the risk in density estimation. We have 
examined this new prior both theoretically and empirically by considering the estimation accuracy 
for a high-dimensional parameter vector as well as the error rates for the multiple testing rule 
induced by applying a threshold rule to the pseudo posterior inclusion probabilities. 
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Our asymptotic results demonstrate that the horseshoe+ estimator achieves a lower MSE and 
the horseshoe+ decision rule attains the Bayes oracle in testing up to 0(1) with a sharper bound 
on the type-I error rate compared to horseshoe. While we have not discussed the asymptotic min- 
imaxity properties of the horseshoe+ estimator in the £2 sense, we conjecture that the asymptotic 
minimaxity will continue to hold, likely with a sharper bound on the constant term compared to 
van der Pas et al. (2014). The sharpening effect of the horseshoe+ prior can be attributed to the 
extra shrinkage gained by having a U-shaped Jacobian over a lopsided one, in addition to the 
U-shaped prior induced on jq. It is also worth noting that asymptotic minimaxity results for a 
class of priors with p(A?) o< (A?) _fl_1 L(A?) where L(-) is a bounded, slowly varying function has 


been established by Ghosh and Chakrabarti (2014). For horseshoe+, L(A,) = A? log A,/ (A? — 1) is 


slowly varying but not bounded. It is an interesting exercise to see if the framework of Ghosh and 
Chakrabarti ( |2014j ) can be adapted for horseshoe+. 

In the recent past, there have been a few shrinkage priors that we collectively call the 'global- 
local' shrinkage priors following Poison and Scott (2010). These priors include the generalized 
double Pareto ( Armagan et al.[ 2013) , the normal-exponential-gamma (Griffin and Brown [2010J , 
the three parameter beta (Armagan et al. 2011) , and the Dirichlet-Laplace (Bhattacharya et al. 


2014), among others. These priors exhibit similar shrinkage properties as the horseshoe prior in 
that they simultaneously squelch the noise to zero and recover the signals. Though these priors 
lead to competitive performances in the sparse signal recovery problem, they also have unique, 
distinguishable characteristics. For example, the generalized double Pareto prior leads to a closed 
form prior density of 9 and induces a sparsity favoring penalty in regularized least squares, while 
the Dirichlet-Laplace prior models the joint distribution of 9 under the two-groups model via the 
joint distribution of the shrinkage parameters. The behavior of the marginal prior densities on 
9 can be seen from Figures [2] and [3j and our simulation results suggest improvements for both 
estimation and testing, but it is an open question whether a set of necessary conditions can be 
imposed on the class of global-local shrinkage priors that guarantees certain desirable properties. 

A key insight we gain from the success of the family of the global-local shrinkage priors is 
that the global shrinkage parameter plays a vital role in controlling the behavior of the posterior. 
Specifically, the global shrinkage parameter in the horseshoe prior needs to be of the order of 
the proportion of non-null effects to ensure asymptotic minimaxity in estimation (van der Pas 


et al. 2014) as well as the optimality of the induced decision rule in testing (Datta and Ghosh 


2013). We have proved that the same condition also guarantees the optimal performance for the 
horseshoe+prior in testing. 

Finally, the horseshoe+ prior can be further extended by modeling the local shrinkage parame¬ 
ter A j as a higher order product of independent half-Cauchy random variables, leading to an even 
heavier tail and larger spike at zero. The moments and densities of the Cauchy product C 1 C 2 ... Q 


are given in Bourgade et al. (2007). The density ¥*(•) of the ^-product C 1 C 2 ... Cjt for the even and 


the odd cases are as follows: 

T 2 i+l(x) 


72/ 


7r(2z)! [f = 


n, (y -i,> + W))A_ 
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Furthermore, one might use the "universal prior" due to Rissanen (i983) over the number of terms 


k in the product density. The "universal prior" is defined with the mass function: 

Q(i) = 2 ~ L ° W , for i = 1,2,...; L°(z) = log* (z) + log c, 

where, log*(x) = log* + log log x + ..., where the sum involves only non-negative terms and 
c = m 2.865064. 

The family of Cauchy-product densities can be used in conjunction with Rissanen's universal 
prior described above to define an adaptive shrinkage estimator such as the Polyshrink estimator 
due to Foster and Stine (2005), where the amount of shrinkage varies adaptively with the esti¬ 
mation task. For an n-dimensional parameter 6, the Polyshrink estimator uses a collection of dis¬ 
crete mixture models Q p = jg eit (y) = (1 — e] c )<p(y) + e k \p(z), = 2 k ~( K+1 ' 1 j, fork = 1,... ,K with 
K = 1 + Llog 2 (p)j and (p(-) and ip(-) denote the standard normal density and Cauchy density with 
scale \/l respectively. We conjecture the advantages of the one-group model over the two-groups 
model would naturally carry over to this case if we use a collection of one-group priors defined 
by Cauchy products of different orders to achieve different amounts of shrinkage. The possibility 
of such extensions was first discussed in Poison and Scott ( |2012 1 , and it would be interesting to 
settle this issue theoretically. 


A Proofs of theorems 

A.l Proof of Theorem 14.II 

Let Phs+ (&) be the marginal density for the horseshoe+ prior. From Equations ( 3.3[|3.4 1 with r = 1, 
we have, 

... 4 1 1 _u_0zlog(A) ,, 

PHS+{6) = ^V^L A e 2A2 A^T rfA - 

First note that, applying the transformation £ = 1 / A 2 we get 


Phs+( 0 ) = ~ /— 


TV 


\/2n Jo 


>§* 4 pQ-a. 

C -1 


For an upper bound one can use the fact that 


f>0. 

C-l ~ 


This gives 


r r Ae* 1 og(0 

lo C~ 






e~i° d£ 


r(i/2) 


(0 2 /2)C2 |0| 
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For a lower bound one can use the fact that 


C>0. 

C-l-l+C 


This gives 


i £ -i»riog(n rfT > 

C -1 - 




C + i 


= 2e e2/2 E 1 (9' I /2) 

>lo §( 1 + ^)' 

where Ei(-) is the exponential integral. Thus, combining the lower and upper bounds. 


/r 2 \/2^ l0S(1 + 0 2) Phs+ ^ 0) - 7T 2 |0|‘ 


This completes the proof of Part (1). Part (2) then follows from the fact that the lower bound goes 
to oo as 9 goes to zero. In comparison, horseshoe has bounds 


\ lo S( 1 + ^) < Phs(O) <Xlog(l + ^), 

with K = 1/(27T 3 ) 1 / 2 . Both the upper and lower bounds are sharper compared to horseshoe+. 
However, the bounds for horseshoe+ can be sharpened using better approximations for log(£) / (£ — 
1 ), for which an infinite product representation is given by 


log£ 2 2 ~ 2 

C-i i + VjN + vTr*'" E[i + f 1/2r 


A.2 Proof of Theorem 14.21 

We write the posterior density of k, given y, in r scale as follows: 

p(K f |yi,r)«(l-JCi) _ 2exp(-K f ^|-^ l lo s{ K * T /(! K ')} 


(?Cj(T 2 + 1 ) — 1 ) 


Now, we use the inequality 1 — 1 / jc < log(x) < z — 1 for x > 0 to derive the following bounds 
for the Jacobian term: 


1 - 


1 ~Kj 
Kj T 2 


1 

KjT 2 


< I log {nt 2 /(1 - Ki)} | < 

|log{Nr 2 /(l- Kl )} | 1 

|(k ! (t 2 + 1 )- 1 )| 1 -k,-' 
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First note that P(7q < e\yi,r) < P(k, < e|y;, T)/P(jq > e|y,,r). Moreover, we can bound the ratio 
P(Kj < e|y,-, t)/P(k; > e|y;, r) as follows: 


P(jCi <e|y;,r) _ /o^ 1 -^) iex p{ 

-«,4} 

|log{/C;T 2 /(l-K;)}| ^ 
|(k ; (t 2 +1)-1)| 

P(k ; > e|y,-,T) £(!_*.)-! exp j 

^ Jo ~ K i )~^ ex P { 

( Ail 

l ' 2 J 

-4} 

1 1 logbi'T 2 /(1 —K;)}| J 

I |( K; (t 2 +1)-1)| fl7v ' 

• (1 - Ki) _1 dJCi 

fe( 1 ~ K i] 2 exp {-K/f 

^ foO--Ki)~ 3/2 d*i 

| {KiT 2 )- X dKi 


ex p{-f *k dK i 

lA/l-e 


The final step follows from the penultimate step by bounding the integral by the extreme values 
of the integrand multiplied by the length of the interval of integration. 


A.3 Proof of Theorem 14.31 

We need the following inequality proved in Appendix |A.2| 

1 |log{x ; r 2 /(l -Kj)} | 1 

K/T 2 |(k/(t 2 + 1) -1 )| l-Ki 

To use the probability concentration inequality in Theorem |4.2| to prove the bound on type-I error 
rate, we need to establish the following fact: 


E (Ki|y/,T) =P(K f > ^|yi,r)(l + o(l)). 


We will prove this in two steps. We first show that the posterior mean can be well approximated 
by evaluating the integral from | to 1, i.e. 

E(K,iyi,r)= f Kip(Ki\y ir T)dKi(l+o(l)) r 
^ 2 

as r —> 0. In the next step, we prove that 



K i p(K i \y i/ T)dKi 


f p(Ki\yi,T)dKi(l + o(l)), 


as t —> 0. First note that: 


1 yf 1 j 

fo 2 K i p(K i \y i ,T)dKi ei f 0 2 k ,-( 1 - k ,-)~ z (1 - K^dKj 

jlKip(Ki\yi,T)dKi - fi Kj(l - Kj)-I ^dKj 
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Thus, 


K 2 
= e 2 t 


f 0 2 Ki(l-Kj) 3/2 dKj 
fl( 1 - Kj)-2dKi 


< (3 — 2\/X)ti t 2 . 


E(K f |y f/ r) = K i p(Ki\y i/ T)dKi + j K/p( jc*-| y f ,T)dx f = ^ Kfp(K;|y,-,r)dK f (l + o(l)) 


as r —>• 0. Next, note that 


//(l -Kf)p(jCi|y f ,T)dKf C 2 /i(l-Kf) 2 ^K,- 

d%i 


fi p(Ki\yi,r)dKi /hi-*) 


.I i 

2 —n 


< e 2 r 2 


Thus, we have K;p(K,|i/,, r)dK z - = fi p(Kj|y;,T)dKj(l — o(l)). The asymptotic expression for the 
type-I error rate is then calculated as 


P Ho ^E(jCj|y;,T) < 0 =P Ho p(K,-|yi, r)dxi < 0 (l+o(l)) 


= F H n 


P( K /|yf,r)dJCi > - (l+o(l)). 


Applying Theorem 


4.2 


fore = |, we have f 0 2 p(Ki\yi, r)dK{ < (2e 2 r 2 )(l + o(l)). 


Ph 0 


^E(jc f |y f ,T) < 0 < Py i ^(o,i)(2e * r 2 > ^)(1 +o(l)) 
= P(|r ! | >2 v /log(l/2r))(l + o(l)) 


^(2y/lQg(l/2T)) 
2 A /log(l/2r) 
[2 r 2 
V 7T y / log(l/2r) 


(1 + ^( 1 )) 
(l+o(l)). 


A.4 Proof of Theorem 14.41 

First note the following identity: 


1 ~ Kj 
KjT 2 


f (l + r 2 )(l -iq) w 1 ) 

I T 2 (1 + T 2 )JC; J 

{(1 + (1 + + (1 T ~ *'> - T ' ) x (1 + 1/(1 + + - + } 
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w, . (1 -Ki(l + r 2 )) (l-jq(l + T 2 )). 

- U 1 + ^2 ) x U + K .(l + T 2) . 

1/( ' 1+ ;c r A*' > — 1, we get the following upper bound to the 
Jacobian term of the horseshoe+ posterior density: 


Since both d+ T H 1 K t) r > and 


|log(US)| _ I log (l + log (i + 


|1 - K{( 1 + T 2 ) 


|1 - Kj( 1 + T 2 ) 


1 

< — 


Kf( 1 + T 2 )' 


(A.l) 


Also, for k, < 1/ (1 + r 2 ), we would have log (Ayr) > 1 ) . Now we use the upper bound in 

Equation ( A.lJ and the above lower bound to derive an upper bound to the tail probability of the 
posterior distribution of k, given r and y,. We will use the fact that 

P(Ki > rj\\ h,t) < P(Ki > rj\yi,r)/V{Ki < rj\y ir x) < P(jc,- > tj\y if t)/P(n < 77%;, r), 

for some fraction 5 E (0,1), such that yS < 1/(1 + r 2 ). Thus, 


f 1 1 exn /_ k —\ 1108< ~ K ‘ 

h; dT=A ex P\ K * 2 / |i-K f (i 


P(k ; - > 7 |r,yi) < 


-Ki(l+T- 


rn 


/( 


tlS i 

o T7T^ ex P 


_ k .A\ i 

Kt 2 / |1 




rffC; 


< e 




-^■(l + T 2 ,, 
-S)4- ^ W-K/ (t 2 K;(H-T 2 )) 


fc 


t]S l 1 

0 Tl-jC; 1-K, 


djC; 


< g-y(i-*)£ 


^/l—1/ arctanh(^/l — tj) 
T 2 + (1 + T 2 ) 


{l/ v /wA/d-l} 

< 1 {y/l-?/ + arctanh(7l^)} 

T 2 {1/yi^-l} 


= e 


■iMfi 


ClJlJ), 


where. 


cM) = 


{ V 7 ! - 7 + arctanh( v /l — rj)} 


{1/^l^yS-l} 

is independent of yi- 

A.5 Proof of Theorem 14.51 

The probability of type-II error for any global-local shrinkage rule is given by: 

f2=P Yi ~/ HA (y) (EM yi^) > ^ • 
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We note that, from Assumption [T] and the aforementioned choice of r, we have log(l/r 2 ) / if) 2 —? 
C G ( 0 ,oo), where C is the threshold appearing in the expression for the risk of Bayes oracle (by 
Equation ( |4.3[ )). First note that, for any // e (o,i), 

Ki < I (Kf G (17,1)) + 17 , 

E(Ki|y f ,r) < P(jc f > i/|yi,T) + ?/. 

Then the type-II error rate can be written as: 

h = P Yi ~W(o,i+^) ^E(Kf|yi,T) > ^ 

< Ey f ~^(0,l+^) ^ F [K;|y,-,T] ( K i > v\Vi’ T ) >\-'1 

We use the upper bound P(jq > rj\yi,r) < exp(— t](l — 8)Yf /2)\C(t],5) from the concentration 
inequality in Theorem |4.4|to derive the following: 


h < Py f ~^ ( o,i+^) (exp (-'/(l - S)Yf /2)^C{rj,6) 

< Py,^(o,i+^) (V < ?/(1 2 _^ | lo s(y^) + lo §(^) j) 

- p z,~W(o,i) ^l z i| < S ^2 ^ as n °°- 

Under the assumption r = 0(}i), we have 1 ° S ^ 1 ,(' T - —> C as n —> 00 , where C is the constant 
appearing in the Bayes risk for the oracle (by Equation (|4.3|>). Thus, we have: 


h < 20( 1 


V (l-5) 


VC)-1 (l+o(l)) 


as n 


00 . 


A.6 Proof of Theorem 14.61 

From Appendix A.l[ the marginal prior, Phs+ ($), f° r horseshoe+ is given by the convolution 

1 


1 f 

VHS+ ^ = V 2 tz 5 / 2 Jo 


C-l 


(A.2) 


m,n 

M 


We can use Meijer G-function convolution identities (Mathai et al.. 2009) to provide a special func 
tion representation of such distributions. Let G 


denote the Meijer-G function. This class 
of functions satisfies a very useful convolution identity with many applications in Bayesian com¬ 
putation. We have the integral identity 


f°° r m,n( 3 P Ca I „rWr - -C n +} l ' m + v ( “ 6 i ~ b " 

Jo L ’ p,q V b ‘/ 1 1 X ) Lr ‘ T ' T (a I CVX ) dX — r j Uj+U.p+T I 




q+cr,p+T l -a 1 ,...,-a n/ d T ,-a n+1/ ..., 


17 























_ 1 ,“-Anr d T/ tl n +lf‘/flp 

~ ^^p+M+cr \b 1/ ...,b m ,-c lT ,b m+1 ,...,b cj 


Furthermore, when one of the G-functions is an exponential function, we obtain a general form 
for the Laplace transform of a G-function: 


-coxr' m,n I a P 


M 


(b’\yx)dx = co \t]CV X ) 


for Re(s) > 0. This can be used to calculate the implied prior via the following identity: 


lo g(*) _ r 2,2 { o 


X — 1 - G T 2 V 0 


x), 


for appropriate x. Equation ( A.2| is then 

Phs+(^) 


Gf 


V2/r 5 / 2 3 ' 2 V W 


2,3 / 1 , 1,1 
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This identity allows us to develop the asymptotic behavior around 9 —> 0 and oo through power- 
logarithmic series expansions for G-functions (Kilbas and Saigo 1999 Mathai et al. 2009j ). The 
dominant terms in a power series expansion around oo are 


„co ff)) = 1 21 og( 0 ) + 7 — log( 2 ) 

P HS +{V) ^ 2 tt 5 / 2 0 2 

where 7 is the Euler-Mascheroni constant, and similarly around zero 


(A.3) 


Phs+(^) — 


24^2/r 5 / 2 


241og z ( - ) + 24 log (2) log ( - 


—247 log 


67 2 + 57 r 2 + 61og z (2) — 127 log( 2 ) I . (A.4) 


The corresponding results for the horseshoe prior are obtained via the substitution u = 1 / A 2 (as 
Carvalho et al. (2010)) and the identity 


m 


1 , 

- e 2 an 


Phs(6) / 

JO ± ~r u 

= Glf( Y| 2 F“ 2 ), 

from which we obtain respectively around 00 and 0 that 

y/2 1 


Phs(0) = 


TV 


3/2 Q2' 


and 


Phs(^) — 


x/ 2 tt 3/ 2 


21 °g ( 0 ) — 7 + l°g (2) ) • 


(A.5) 


(A. 6 ) 
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The behavior of the Bayes risk depends crucially on the amount of mass the prior places around 
the origin. Using Equation (A.4|l and integrating (on one side) around zero. 


[ " P°HS+(®)dO 

J u 


24 V2tz 5 / 2 y/Ti ( 61 o S( n )( lo §( n ) _2 ^ + 4 + lo §( 4 )) +6 ^ 2 + 57r2 


+ 


6 (8 + log 2 (2) +log(16)) - 12 7 (2+ log(2))) . 


Collecting the higher order terms in n we get 


log 2 {n) 


\/2n 5/2 ^/n 


+ 1 - 


7 log (4) 


) log(w) +0(1) j . 


Similarly, using Equation (|A .6 1 we have 


P°Hs(®)dO — 


/log(n) 


\/ 2 TZ 3/2 ^/n V 2 

completing the proof of Theorem |4.6[ 

A.7 Proof of Theorem 14.71 

Using Equation ( |4.9| in Equation ( |4.7| for large values of jy, | yields, 

E HS+(Oi\yi ) = yi + ^-[logm HS+ (yi)} 


+ 0 ( 1 ) , 




Thus, 


Bias H s+(0/|yi) = Bias H s(0;|i/;) 


-O 


Ti 


U'loglU 

Continuing the calculation for variances we have (by Equation ( |4.10| l and ( |4.8| ), 

d 2 

d y} 


v HS+{Oi\yi ) = I + X2N m HS+ (yi)] 


= V Hs(9i\yi ) - 


1 +l°s |y,-| 


°Ut 


(y/ !og ly/l ) 2 

Thus, comparing the posterior MSE of the two estimators for large values of y we have 

MSE H s+(0i|yf) = ^s 2 HS+ (6i\yi)+V H s+{di\yi) 


= (Bias H s(0i|yi) + 


= MSE HS (0/|y«)+2 


y/iogly. 
Bias H s(0;|y;) 


o(i/yf)r + v HS (0/|yi)- 


1 + !og ly/l 
(y/!og lyd ) 2 


of 4 


y+°gN y?i°g|yi 


+ 0 


yf iog lyd 


0 A 
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= MSE HS (0/|y«)- 


Vi lo glU 


of 4 


where the last line follows from the fact that Bias hs(^; I Vi) = 0(1/y?) (from Theorem 3 in 
et al.j(20To|). 


Carvalho 
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